Description

This is a template for exercise 6 in Chapter 2 of Bit By Bit: Social Research in the Digital Age by Matt Salganik. The problem is reprinted here with some additional comments and structure to facilitate a solution.

The original problem statement:

In a widely discussed paper, Michel and colleagues (2011) analyzed the content of more than five million digitized books in an attempt to identify long-term cultural trends. The data that they used has now been released as the Google NGrams dataset, and so we can use the data to replicate and extend some of their work.

In one of the many results in the paper, Michel and colleagues argued that we are forgetting faster and faster. For a particular year, say “1883,” they calculated the proportion of 1-grams published in each year between 1875 and 1975 that were “1883”. They reasoned that this proportion is a measure of the interest in events that happened in that year. In their figure 3a, they plotted the usage trajectories for three years: 1883, 1910, and 1950. These three years share a common pattern: little use before that year, then a spike, then decay. Next, to quantify the rate of decay for each year, Michel and colleagues calculated the “half-life” of each year for all years between 1875 and 1975. In their figure 3a (inset), they showed that the half-life of each year is decreasing, and they argued that this means that we are forgetting the past faster and faster. They used Version 1 of the English language corpus, but subsequently Google has released a second version of the corpus. Please read all the parts of the question before you begin coding.

This activity will give you practice writing reusable code, interpreting results, and data wrangling (such as working with awkward files and handling missing data). This activity will also help you get up and running with a rich and interesting dataset.

The full paper can be found here, and this is the original figure 3a that you’re going to replicate:

Part A

Get the raw data from the Google Books NGram Viewer website. In particular, you should use version 2 of the English language corpus, which was released on July 1, 2012. Uncompressed, this file is 1.4GB.

Get and clean the raw data

Edit the 01_download_1grams.sh file to download the googlebooks-eng-all-1gram-20120701-1.gz file and the 02_filter_1grams.sh file to filter the original 1gram file to only lines where the ngram matches a year (output to a file named year_counts.tsv).

Then edit the 03_download_totals.sh file to down the googlebooks-eng-all-totalcounts-20120701.txt and file and the 04_reformat_totals.sh file to reformat the total counts file to a valid csv (output to a file named total_counts.csv).

Load the cleaned data

Load in the year_counts.tsv and total_counts.csv files. Use the here() function around the filename to keep things portable.Give the columns of year_counts.tsv the names term, year, volume, and book_count. Give the columns of total_counts.csv the names year, total_volume, page_count, and book_count. Note that column order in these files may not match the examples in the documentation.

year_counts <- read_tsv('year_count.tsv', col_names = c("term", "year", "volumn", "book_count"))
## Parsed with column specification:
## cols(
##   term = col_double(),
##   year = col_double(),
##   volumn = col_double(),
##   book_count = col_double()
## )
total_counts <- read_csv('total_count.csv', col_names = c("year", "total_volume", "page_count", "book_count"))
## Parsed with column specification:
## cols(
##   year = col_double(),
##   total_volume = col_double(),
##   page_count = col_double(),
##   book_count = col_double()
## )
head(year_counts)
head(total_counts)
str(year_counts)
## tibble [53,393 x 4] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ term      : num [1:53393] 1817 1817 1817 1817 1817 ...
##  $ year      : num [1:53393] 1524 1575 1607 1637 1662 ...
##  $ volumn    : num [1:53393] 31 17 3 2 1 5 8 1 1 1 ...
##  $ book_count: num [1:53393] 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   term = col_double(),
##   ..   year = col_double(),
##   ..   volumn = col_double(),
##   ..   book_count = col_double()
##   .. )

Your written answer

Total number of lines in year_count.tsv is 53393 Total number of lines in total_count.csv is 425

Part B

Recreate the main part of figure 3a of Michel et al. (2011). To recreate this figure, you will need two files: the one you downloaded in part (a) and the “total counts” file, which you can use to convert the raw counts into proportions. Note that the total counts file has a structure that may make it a bit hard to read in. Does version 2 of the NGram data produce similar results to those presented in Michel et al. (2011), which are based on version 1 data?

Join ngram year counts and totals

Join the raw year term counts with the total counts and divide to get a proportion of mentions for each term normalized by the total counts for each year.

#joined the year_counts with total counts and calculated the proportion
joined_year_counts <- year_counts %>% 
  filter( term == c(1883,1910,1950))%>% 
  left_join(total_counts, by = "year") %>% 
  mutate(proportion = volumn/total_volume)
## Warning in term == c(1883, 1910, 1950): longer object length is not a multiple
## of shorter object length

Plot the main figure 3a

Plot the proportion of mentions for the terms “1883”, “1910”, and “1950” over time from 1850 to 2012, as in the main figure 3a of the original paper. Use the percent function from the scales package for a readable y axis. Each term should have a different color, it’s nice if these match the original paper but not strictly necessary.

# change the type of term as factor so I can color it by term (the original data has term as number)
joined_year_counts$term <- factor(joined_year_counts$term)

joined_year_counts %>% 
  group_by(term) %>% 
  ggplot(aes(x = year, y = proportion, color = term)) +
  geom_line()+
  xlim(c(1850,2012))+
  scale_y_continuous(label = percent)+
  scale_colour_manual(values=c("blue", "green","red")) + # used to sepcify the color of the lines
  ylab('Frequency')+
  xlab('Year')
## Warning: Removed 116 row(s) containing missing values (geom_path).

Your written answer

The graphs produced based on version 1 adn version 2 are similar.

Part C

Now check your graph against the graph created by the NGram Viewer.

Compare to the NGram Viewer

Go to the ngram viewer, enter the terms “1883”, “1910”, and “1950” and take a screenshot.

Your written answer

Graph from ngram viewer

Graph from ngram viewer

Part D

Recreate figure 3a (main figure), but change the y-axis to be the raw mention count (not the rate of mentions).

Plot the main figure 3a with raw counts

Plot the raw counts for the terms “1883”, “1910”, and “1950” over time from 1850 to 2012. Use the comma function from the scales package for a readable y axis. The colors for each term should match your last plot, and it’s nice if these match the original paper but not strictly necessary.

joined_year_counts %>%
  ggplot(aes(x = year, y = volumn, color = term)) +
  geom_line()+
  xlim(c(1850,2012))+
  scale_y_continuous(label = comma)+
  ylab('Count')+
  xlab('Year')
## Warning: Removed 116 row(s) containing missing values (geom_path).

Part E

Does the difference between (b) and (d) lead you to reevaluate any of the results of Michel et al. (2011). Why or why not?

As part of answering this question, make an additional plot.

Plot the totals

Plot the total counts for each year over time, from 1850 to 2012. Use the comma function from the scales package for a readable y axis. There should be only one line on this plot (not three).

ggplot(total_counts, aes(x = year, y = total_volume)) + 
  geom_line()+
  scale_y_continuous(label = comma)

Your written answer

Write up your answer to Part E here.

Part F

Now, using the proportion of mentions, replicate the inset of figure 3a. That is, for each year between 1875 and 1975, calculate the half-life of that year. The half-life is defined to be the number of years that pass before the proportion of mentions reaches half its peak value. Note that Michel et al. (2011) do something more complicated to estimate the half-life—see section III.6 of the Supporting Online Information—but they claim that both approaches produce similar results. Does version 2 of the NGram data produce similar results to those presented in Michel et al. (2011), which are based on version 1 data? (Hint: Don’t be surprised if it doesn’t.)

Compute peak mentions

For each year term, find the year where its proportion of mentions peaks (hits its highest value). Store this in an intermediate dataframe.

all_year_counts <- year_counts %>% 
  filter(year >=1800) %>% 
  left_join(total_counts, by = "year") %>% 
  mutate(proportion = volumn/total_volume)

peak<- all_year_counts %>% 
  group_by(term) %>% 
  filter(proportion == max(proportion)) %>% 
  select(term, year, proportion)%>% 
  arrange(term)

peak

Compute half-lifes

Now, for each year term, find the minimum number of years it takes for the proportion of mentions to decline from its peak value to half its peak value. Store this in an intermediate data frame.

peak_half <- peak %>% 
  mutate(half_life = proportion/2) %>% 
  select(term,half_life) %>% arrange(term)

peak_for_join <- peak %>% rename(peak_year = year) %>% select(term, peak_year)
# now we know the value  for half_life, need to find the year corresponse to it
half_life_year <- all_year_counts %>% 
  left_join(peak_half, by = "term") %>% 
  left_join(peak_for_join, by = "term") %>%
  filter(year >  peak_year & proportion<=half_life) %>% 
    select(term, year, proportion, half_life, peak_year) %>% 
  group_by(term) %>% 
  summarise(half_life_year = min(year)) %>%
  left_join(peak_for_join, by="term") %>% 
  select(term, half_life_year, peak_year) %>% 
  mutate(half_year = half_life_year - peak_year)
half_life_year %>% View

Plot the inset of figure 3a

Plot the half-life of each term over time from 1850 to 2012. Each point should represent one year term, and add a line to show the trend using geom_smooth().

ggplotly(ggplot(half_life_year, aes(x = term, y = half_year)) +
  geom_point())
  # geom_point(data=half_life_year[term == c(1883,1910,1950)], aes(x=term, y=half_year), colour="red", size=5)

Your written answer

Write up your answer to Part F here.

Part G

Were there any years that were outliers such as years that were forgotten particularly quickly or particularly slowly? Briefly speculate about possible reasons for that pattern and explain how you identified the outliers.

Your written answer

Write up your answer to Part G here. Include code that shows the years with the smallest and largest half-lifes.

Makefile

Edit the Makefile in this directory to execute the full set of scripts that download the data, clean it, and produce this report. This must be turned in with your assignment such that running make on the command line produces the final report as a pdf file.